Skip to content

Conversation

@wooway777
Copy link
Collaborator

resolves #900
resolves #846
includes #859

天数
image
image

沐曦
image

摩尔
image

…graph recording

- Ensure embedding tensors are on the same device. Change format.
- Optimize embedding kernel with vectorized memory access and __ldg
- Add vectorized memory access using float4/float2, half2, and bfloat162
- Use __ldg instruction for read-only weight and indices access
- Add memory alignment checks to enable vectorized paths
- Add __restrict__ keywords for better compiler optimization
- Implement dynamic block size selection based on embedding_dim
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds device-side embedding support for CUDA and CUDA-like platforms (NVIDIA, Metax, Moore) to enable CUDA Graph recording. The key improvement is removing synchronous CPU transfers that previously prevented graph recording, replacing them with fully asynchronous device-side kernel implementations.

Key changes:

  • Implements device-side embedding kernels for NVIDIA, Metax, and Moore platforms with optimized vectorized memory access
  • Removes synchronous to(cpu_device) operations from test and production code
  • Adds comprehensive test suite for validating CUDA Graph recording support

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test/infinicore/ops/embedding.py Removed CPU conversion logic, now supports device-side input directly
test/infinicore/nn/embedding.py Removed CPU conversion logic for testing
test/infinicore/nn/test_embedding_graph_recording.py New comprehensive test suite for CUDA Graph recording validation
test/infinicore/nn/HOW_TO_USE_GRAPH_RECORDING_TEST.md Documentation on how to use and interpret the graph recording tests
test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md Technical comparison of before/after implementation details
src/infiniop/ops/embedding/operator.cc Multi-platform dispatcher for embedding operations
src/infiniop/ops/embedding/embedding.h Common descriptor macro definition
src/infiniop/ops/embedding/cpu/embedding_cpu.h CPU implementation header
src/infiniop/ops/embedding/cpu/embedding_cpu.cc CPU implementation with memcpy-based embedding lookup
src/infiniop/ops/embedding/cuda/embedding_kernel.cuh Shared CUDA kernel helper functions with vectorized memory access
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cuh NVIDIA platform header
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cu NVIDIA CUDA kernel implementation with float4/half2/bfloat162 optimization
src/infiniop/ops/embedding/metax/embedding_metax.cuh Metax platform header
src/infiniop/ops/embedding/metax/embedding_metax.maca Metax MACA kernel implementation
src/infiniop/ops/embedding/moore/embedding_moore.h Moore platform header
src/infiniop/ops/embedding/moore/embedding_moore_kernel.h Moore-specific kernel helpers
src/infiniop/ops/embedding/moore/embedding_moore.mu Moore MUSA kernel implementation
src/infinicore/ops/embedding/embedding_infiniop.cc InfiniOP wrapper with descriptor caching
src/infinicore/ops/embedding/embedding.cc Op dispatcher and device synchronization logic
src/infinicore/nn/embedding.cc Simplified forward method using new device-side ops
python/infinicore/nn/functional/embedding.py Removed CPU-only assertion to support device inputs
include/infiniop/ops/embedding.h Public API for embedding operations
include/infiniop.h Added embedding.h include
include/infinicore/ops/embedding.hpp Added Embedding class with dispatcher pattern
include/infinicore/ops.hpp Added embedding.hpp include

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +54
} else if constexpr (std::is_same_v<T, cuda_bfloat16>) {
// Use bfloat162 for vectorized access
if (embedding_dim >= 2 && embedding_dim % 2 == 0) {
copyVectorizedBFloat162<IndexType>(dst, src, embedding_dim);
} else {
copyScalar<T, IndexType>(dst, src, embedding_dim);
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embeddingKernel template checks for cuda_bfloat16 type on line 48, but this is Metax platform code which should use __hpcc_bfloat16 instead. This type mismatch means the vectorized bfloat16 path (lines 48-54) will never execute on Metax platform. The kernel should check for the Metax-specific type __hpcc_bfloat16 to enable proper vectorization.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +146
} else if (_embedding_dim >= 1024) {
block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure
} else if (_embedding_dim <= 256) {
block_size = 384; // Medium embedding_dim: balanced configuration
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block size adjustment logic has an inconsistency. When _embedding_dim <= 256, the code sets block_size = 384 (line 91), but this case only applies if _embedding_dim is greater than 64 (because line 87 handles <= 64). This means embedding_dim values between 65-256 will use block_size=384. However, this contradicts the default of 256 which was already set. The logic should be restructured to avoid overlapping conditions and ensure proper block size selection.

Suggested change
} else if (_embedding_dim >= 1024) {
block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure
} else if (_embedding_dim <= 256) {
block_size = 384; // Medium embedding_dim: balanced configuration
} else if (_embedding_dim <= 256) {
block_size = 384; // Medium embedding_dim: balanced configuration
} else if (_embedding_dim >= 1024) {
block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure

Copilot uses AI. Check for mistakes.
print("⚠ CUDA 不可用,跳过图录制测试")
return False

device = infinicore.device("cuda", 0)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot uses AI. Check for mistakes.
print("⚠ CUDA 不可用,跳过测试")
return False

device = infinicore.device("cuda", 0)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot uses AI. Check for mistakes.
@wooway777 wooway777 force-pushed the issue/900-temp branch 2 times, most recently from 98a0be1 to 7997d89 Compare January 9, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEV] 类CUDA embedding [DEV] 增加支持图录制的embedding算子

3 participants